Statistical data preparation: management of missing values and outliers
نویسندگان
چکیده
Missing values and outliers are frequently encountered while collecting data. The presence of missing values reduces the data available to be analyzed, compromising the statistical power of the study, and eventually the reliability of its results. In addition, it causes a significant bias in the results and degrades the efficiency of the data. Outliers significantly affect the process of estimating statistics (e.g., the average and standard deviation of a sample), resulting in overestimated or underestimated values. Therefore, the results of data analysis are considerably dependent on the ways in which the missing values and outliers are processed. In this regard, this review discusses the types of missing values, ways of identifying outliers, and dealing with the two.
منابع مشابه
A Perception of Statistical Inference in Data Mining
As we know that data mining is concern with learning from data therefore, completeness, quality and real world data preparation, is a key prerequisite of successful data mining with its aim to discover something new from the facts already recorded in the certain database. Preparation of data is a fundamental stage of data analysis. During data preparation, the major problem occurs due to missin...
متن کاملAnalysis of Missing Value Estimation Algoithms for Data Farming
In this paper we compare various statistical method of estimation of missing data values. Missing data estimation is a part of data farming. Data Farming is a process to grow the data & provides a more comprehensive understanding of the possible outcomes, and offers the opportunity to discover outliers, surprises. Many times data mining task use existing data collected for various other purpose...
متن کاملPerformance evaluation of different estimation methods for missing rainfall data
There are numerous methods to estimate missing values of which some are used depending on the data type and regional climatic characteristics. In this research, part of the monthly precipitation data in Sarab synoptic station, east Azerbaijan province, Iran was randomly considered missing values. In order to study the effectiveness of various methods to estimate missing data, by seven classic s...
متن کاملThe identification, impact and management of missing values and outlier data in nutritional epidemiology.
When performing nutritional epidemiology studies, missing values and outliers inevitably appear. Missing values appear, for example, because of the difficulty in collecting data in dietary surveys, leading to a lack of data on the amounts of foods consumed or a poor description of these foods. Inadequate treatment during the data processing stage can create biases and loss of accuracy and, cons...
متن کاملA method to solve the problem of missing data, outlier data and noisy data in order to improve the performance of human and information interaction
Abstract Purpose: Errors in data collection and failure to pay attention to data that are noisy in the collection process for any reason cause problems in data-based analysis and, as a result, wrong decision-making. Therefore, solving the problem of missing or noisy data before processing and analysis is of vital importance in analytical systems. The purpose of this paper is to provide a metho...
متن کامل